Skip to content

p2p: make dial faster by streamlined discovery process#31678

Closed
cskiraly wants to merge 2 commits intoethereum:masterfrom
cskiraly:dial-turbo
Closed

p2p: make dial faster by streamlined discovery process#31678
cskiraly wants to merge 2 commits intoethereum:masterfrom
cskiraly:dial-turbo

Conversation

@cskiraly
Copy link
Copy Markdown
Contributor

@cskiraly cskiraly commented Apr 19, 2025

This PR improves the speed of Disc/v4 and Disc/v5 based discovery. It builds on #31592. To be rebased after merging that PR.

Our dial process is rate-limited, but until now, during startup, our discovery was too slow to serve dial. So the bottleneck was discovery, not the rate limit in dial, resulting in slow ramp-up of outgoing peer count.

This PR making discovery fast enough to serve dial's needs. The rate-limit of dial is still limiting the discovery process, so we are not risking being too aggressive in our discovery.

The PR adds:

  • a pre-fetch buffer to discovery sources, eliminating slowdowns due to timeouts and rate mismatch between the two precesses
  • multiple parallel lookups to make sure enough dial candidates are available all the time

@cskiraly
Copy link
Copy Markdown
Contributor Author

First hour of dial progress before this PR: number of outgoing peers
image

First hour of dial progress after this PR (and #31592): number of outgoing peers
image

@cskiraly
Copy link
Copy Markdown
Contributor Author

Discovery traffic before:
image

And after:
image

This PR increases the initial UDP traffic, mostly caused by the pre-fetch buffer. We might dial that down a bit, although I don't think the amount of traffic generated (~70 KB/s) is too much. On the long term, average discovery traffic should be the same.

@cskiraly cskiraly force-pushed the dial-turbo branch 2 times, most recently from a5b1b9d to 016bf44 Compare April 24, 2025 20:32
@cskiraly
Copy link
Copy Markdown
Contributor Author

@fjl I was thinking about the increased UDP traffic. There is the case of a very small network, smaller than maxpeers. For example a small test network with only a few nodes. In that case discovery might go full-steam continuously, looking for new dial candidates. In this specific case this PR would increase the background UDP traffic by roughly 3x.

If we think this is a problem, we might use discoveryParallelLookups=2 to be more conservative. Or should we maybe introduce an explicit "smallnet" flag and tune parameters accordingly? Or can we just expect people to set maxpeers wisely for a small network? I would say the last option would be fine, but we need some mods to make that work. See below.

A short description of what (supposed to) happen in detail. There are two cases:

  • a small net using the large public disc-v4/v5 DHT.
  • a small net using a private disc-v4/v5 DHT.
    Before p2p: Filter Discv4 dial candidates based on forkID #31592 these would behave differently, but after adding the filter on discv4, both should behave the same, except for the number of discarded peers during discovery.

In discovery, we only have duplicate filtering in lookup, by the seen cache mechanism. However, the lookupIterator will start new lookups, and these can return the same nodes again. There is no duplicate filtering between sequential lookups, or between parallel lookups as introduced here by discoveryParallelLookups. So discovery can go on full-steam, feeding "new" (which are the same as the old) dial candidates.

Dial pulls from this, and has its own duplicate detection through dialHistoryExpiration. When it finds a duplicate within the expiration time, it throws it away and pulls more candidates from discovery. It only throttles down when it starts to get closer to maxpeers, but that will not happen if maxpeers is significantly larger than the network size.

If instead maxpeers is set correctly, dial will try to slow down. It will reduce slots. But even with a single slot, it will pull and discard in a loop, making discovery go full steam. Should we introduce some time-based mechanism here? E.g. throttle if checkDial fails too many times in a row?

@fjl fjl self-assigned this Apr 29, 2025
@cskiraly
Copy link
Copy Markdown
Contributor Author

Testing on a small network (Sepolia), this is the first hour of operation before this PR:
image
And after this PR:
image

@cskiraly
Copy link
Copy Markdown
Contributor Author

cskiraly commented Jun 1, 2025

Separated from #31944 . Now this PR only contains the parallelized lookup part.

@fjl
Copy link
Copy Markdown
Contributor

fjl commented Jun 17, 2025

I think we have all of this in already.

@fjl fjl closed this Jun 17, 2025
@cskiraly
Copy link
Copy Markdown
Contributor Author

Actually the parallelised lookup part is not yet merged. I'm reopening this to rebase what's still missing.

@cskiraly cskiraly reopened this Aug 15, 2025
Signed-off-by: Csaba Kiraly <csaba.kiraly@gmail.com>
Signed-off-by: Csaba Kiraly <csaba.kiraly@gmail.com>
@cskiraly
Copy link
Copy Markdown
Contributor Author

We have discussed to implement this in a different way. Closing.

@cskiraly cskiraly closed this Dec 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants